Practising

Things I'm going to mention

  • Testing
  • Continuous Integration
  • Static Code Analysis
  • Notebooks (R and Jupyter)

Are you talking to me?

Why should I listen?

"data science should follow the rules of good software engineering!" - (Grus 2018)

The game is Hypothesis testing

  • Reproducibility (Peng 2011) - test condition is met
  • 10 Simple Rules (Sandve et al. 2013)
  • Results of data collection/analysis used to test hypothesis
    • using several different methodologies (Drummond 2017)
  • While I'm not talking about p-hacking (Simmons, Nelson, and Simonsohn 2011, and @Nuzzo2014)
  • Excel deserves a dishourable mention as the cause of gene name errors are widespread in the scientific literature (Ziemann, Eren, and El-Osta 2016)

Collaboration, Usage

In the case of open source software which is now basis of data and analytical science (Ince, Hatton, and Graham-Cumming 2012)

  • Other users need trust
  • Check applicable to their systems and data needs
  • Installation
  • Contribution model
  • Exemplar usage

How - R development tools

library(devtools)
library(usethis)
library(testthat)

Creating a new package

pkg_name <- 'TestUtils'
usethis::create_package(path = pkg_name,
                        open = FALSE,
                        fields = list(Description = "TestUtils package",
                                      Title       = "Provides an example of testing",
                                      Package     = pkg_name))

Enabling testing

setwd(pkg_name)
usethis::use_testthat()
## ✔ Setting active project to '/Users/hrards/code/DataScienceWorkshop/Workshop2018/presentations/day-2/09_15_Testing_Hypotheses/TestUtils'
## ✔ Creating 'tests/testthat/'
## ✔ Writing 'tests/testthat.R'

Anatomy of a testsuite

tree --charset=ascii TestUtils/tests 
## TestUtils/tests
## |-- testthat
## |   |-- helper-load-data.R
## |   |-- helper-mock-functions.R
## |   |-- setup.R
## |   |-- test-complex.R
## |   |-- test-regression.R
## |   `-- test-simple.R
## `-- testthat.R
## 
## 1 directory, 7 files

Anatomy of a testfile

cat TestUtils/tests/testthat/test-simple.R
## context("Simple Test")
## 
## test_that("Simple Description", {
##   ## simple test case
##   expect_true(TRUE, label = "TRUE really is TRUE")
##   ## data vs expectation
##   data <- 1:100
##   expect_equal(data, 1:100, label = "data matches sequence")
## })

Simple

context("Simple Test")
test_that("Simple Description", {
  ## simple test case
  expect_true(TRUE, label = "TRUE really is TRUE")
  ## data vs expectation
  data <- 1:100
  expect_equal(data, 1:100, label = "data matches sequence")
})

Failing Tests

context("Simple Test")
test_that("Simple Description", {
  ## simple test case that will fail
  expect_true(FALSE, label = "TRUE really is TRUE")
})

Enabling Continuous Integration

setwd(pkg_name)
usethis::use_travis()

Enabling Coverage Metrics

setwd(pkg_name)
usethis::use_coverage(type = "coveralls")

Library Example

Notebooks

  • Notebooks are fun / useful
  • Notebooks have problems (Grus 2018)
    • order of execution
    • hidden state
    • how do you test / know code is correct
    • dependencies
    • not a text editor

Jupyter vs RStudio

 

 

img

R Notebooks and testing

internal_package <- "TestUtils"
devtools::load_all(path = internal_package)
## Loading code
devtools::test(pkg = internal_package)
## Loading code
## Testing code
## ✔ |  OK F W S | Context
## 
⠏ |   0       | Loading more
✔ |   1       | Loading more
## 
⠏ |   0       | Loading more
✔ |   0     1 | Loading more
## ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## test-regression.R:3: skip: More complex
## Empty test
## ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## 
⠏ |   0       | Simple Test
✔ |   2       | Simple Test
## 
## ══ Results ════════════════════════════════════════════════════════════════════════════════════════════════
## Duration: 0.1 s
## 
## OK:       3
## Failed:   0
## Warnings: 0
## Skipped:  1

References

Drummond, Chris. 2017. “Reproducible Research: A Minority Opinion.” Journal of Experimental & Theoretical Artificial Intelligence 30 (1). Informa UK Limited: 1–11. doi:10.1080/0952813x.2017.1413140.

Grus, Joel. 2018. “I Don’t Like Notebooks - Joel Grus - #Jupytercon 2018 - Google Slides.” Allen Institute for Artificial Intelligence. Accessed September 6. https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1.

Ince, Darrel C., Leslie Hatton, and John Graham-Cumming. 2012. “The Case for Open Computer Programs.” Nature 482 (7386). Springer Nature: 485–88. doi:10.1038/nature10836.

Nuzzo, Regina. 2014. “Scientific Method: Statistical Errors.” Nature 506 (7487). Springer Nature: 150–52. doi:10.1038/506150a.

Peng, R. D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060). American Association for the Advancement of Science (AAAS): 1226–7. doi:10.1126/science.1213847.

Perkel, Jeffrey. 2018. “When It Comes to Reproducible Science, Git Is Code for Success | Nature Index.” Springer Nature Limited. Accessed September 4. https://www.natureindex.com/news-blog/when-it-comes-to-reproducible-science-git-is-code-for-success.

Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” Edited by Philip E. Bourne. PLoS Computational Biology 9 (10). Public Library of Science (PLoS): e1003285. doi:10.1371/journal.pcbi.1003285.

Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology.” Psychological Science 22 (11). SAGE Publications: 1359–66. doi:10.1177/0956797611417632.

Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1). Springer Nature. doi:10.1186/s13059-016-1044-7.

www.plantandfood.co.nz